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An  Efficient  Parallel  Finite-Element-Based  Domain  Decomposition 
Iterative  Technique  With  Polynomial  Preconditioning 

Yu  Liang  *  Ramdev  Kanapady  '  and  Kumar  K.  Tamma  * 


Abstract 

An  efficient  parallel  finite  element-based  domain  decomposition  iterative  technique  with  polynomial  precondi¬ 
tioning  with  particular  attention  to  the  GMRES  solver  is  presented.  Unlike  the  standard  row-oriented  partitioning 
of  a  matrix,  finite  element  based  domain  decomposition  with  polynomial  preconditioning  circumvents  the  assembly 
of  matrix,  reordering  of  matrix,  redundant  computations  associated  with  the  interface  elements,  numerical  problems 
associated  with  local  preconditioner,  and  costly  global  preconditioner  construction.  A  dramatic  reduction  in  parallel 
overhead  both  in  terms  of  computation  and  communication  results  in  a  highly  scalable  solver.  The  parallel  perfor¬ 
mance  results  for  large-scale  static  and  dynamic  problems  on  the  IBM  SP2  and  the  SGI  Origin  are  presented. 

Keywords:  GMRES,  domain  decomposition,  distributed  format,  polynomial  preconditioner,  finite  element,  linear 
equations. 


1  Introduction 

Parallel  iterative  solvers  are  better  alternative  than  the  parallel  direct  solvers  for  large-scale  finite  element  computations 
[1,2].  Numerically  scalable  domain  decomposition  based  solvers  such  as  finite  element  tearing  and  interconnecting 
(FETI)  method  [3,4]  are  better  suited  for  highly  parallel  finite  element  computations.  This  however  is  mainly  restricted 
to  symmetrical  system  arising  from  finite  element  formulations  for  solid  and  structural  mechanics  based  applications. 
The  iterative  solvers  such  as  GMRES  method  [5]  are  an  effective  method  for  iterative  solution  of  specific  classes  of 
un-symmetric  and  symmetric  linear  systems  arising  from  finite  element  formulations  [1],  The  rate  of  convergence 
of  the  iterative  algorithms  is  governed  by  the  condition  number  of  the  linear  systems,  which  increases  dramatically 
for  finite  element  matrices  as  the  mesh  size  decreases.  As  a  result,  an  efficient  preconditioner  is  required  for  the 
solution  of  the  linear  systems.  Finite  element-edge  based  domain  decomposition  (DD)  provides  an  intuitive  while 
powerful  framework  for  the  parallel  formulation  of  finite  element  method  (FEM)  [6].  Within  this  DD  framework 
providing  an  efficient  preconditioner  for  an  iterative  solver  is  a  challenging  task.  The  Sparse  Approximate  Inverse 
(SPAI)  [7]  and  diagonal  preconditioners  are  commonly  used  preconditioner  for  parallel  formulations.  However,  SPAI 
is  not  suitable  for  finite  element  edge  based  domain  decomposition  strategies  [8]  and  diagonal  preconditioners  are  not 
effective  enough  to  reduce  the  number  of  iterations  for  large-scale  complex  problem.  Incomplete  LU  factorization 
(ILU(k))  [7]  based  preconditioners  are  effective  for  single  processors  computational  framework.  However,  for  parallel 
framework,  its  use  is  mainly  limited  as  it  incurs  high  parallel  communication  overhead  similar  to  that  of  direct  solvers, 
and  in  particular  for  the  finite  element  edge-based  DD  framework.  The  polynomial  preconditioners  [9]  are  simple, 
yet  efficient  and  effective  methods  for  efficient  parallel  iterative  solvers.  Polynomial  preconditioning  methods  has 
been  investigated  in  literature  [5,7,9-16],  In  parallel  framework,  polynomial  preconditioning  methods  are  mainly 
associated  with  row  partitioning  of  the  matrix  of  the  linear  system. 

In  this  paper,  an  efficient  parallel  finite  element-based  domain  decomposition  GMRES  solver  in  conjunction  with 
polynomial  preconditioning  is  presented.  The  distinguishing  features  of  the  proposed  approach  is  that  it  circumvents: 
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i)  the  assembly  of  the  finite  element  coefficient  matrix  and  preconditioner  associated  with  subdomain  interface,  ii) 
reordering  of  a  matrix  to  gain  parallel  performance,  iii)  redundant  computations  associated  with  the  interface  elements, 
and  iv)  numerical  problems  associated  with  local  ILU(k)  such  as  singularity  preconditioner.  This  results  in  a  dramatic 
reduction  in  the  parallel  overhead  both  in  terms  of  computation  and  communication  and  results  in  a  highly  scalable 
solver.  The  flexible  GMRES  with  Neumann  series  and  generalized  least-squares  polynomial  preconditioner  [15]  are 
employed  here  for  large-scale  static  2nd  order  elasticity  and  elastodynamics  problems  on  an  IBM  SP2  and  SGI  Origin 
to  illustrate  the  effectiveness  and  parallel  performance  of  the  proposed  approaches. 

The  remainder  of  this  paper  is  organized  as  follows.  In  Section  2  the  sequential  preconditioning  techniques  with  fo¬ 
cus  on  polynomial  preconditioning  methods  for  their  subsequent  parallel  formulation  is  presented.  Section  3  discusses 
the  parallel  implementation  of  a  finite  element  based  domain  decomposition  flexible  GMRES  solver  followed  by  the 
discussions  on  parallel  implementation  of  a  flexible  GMRES  solver  for  row-based  domain  decomposition  techniques 
in  Section  4.  In  Section  5  the  communication/computation  complexity  of  the  finite  element-based  and  node-based 
flexible  GMRES  is  briefly  discussed  to  highlight  the  advantages  of  the  finite  element-based  domain  decomposition 
based  solver.  In  Section  6,  the  convergence  and  parallel  performance  for  static  and  dynamic  problems  employing  the 
present  parallel  polynomial  preconditioning  flexible  GMRES  is  provided  followed  by  conclusions  in  Section  7. 

2  Preconditioned  Iterative  Solvers 

In  this  section,  the  commonly  used  preconditioning  techniques  with  particular  attention  to  the  polynomial  precon¬ 
ditioning  methods  are  presented.  The  polynomial  preconditioning  techniques  such  as  the  Neumann  series  and  the 
generalized  least-squares  (GLS)  methods  are  discussed.  The  construction  and  factors  affecting  the  numerical  stabil¬ 
ity  of  these  polynomial  preconditioners  are  also  discussed.  An  important  pre-processing  technique  such  as  diagonal 
scaling  to  effectively  apply  the  polynomial  preconditioning  is  also  discussed  which  is  subsequently  employed  in  the 
parallel  implementation  of  both  element-  and  node-based  partitioning  based  strategies. 

2.1  Preconditioning  methods 

Implicit  finite  element  computations  in  several  scientific  and  engineering  problems  such  as  linear/nonlinear,  static  or 
dynamic  problems  requires  the  resulting  solution  of  a  system  of  linear  equations  such  as 


Ku  =  f  (1) 

where  K  €  9vVxA:'  is  a  sparse  coefficient  matrix,  u  S  9tv  is  the  vector  of  unknowns  and  f  g  9vv  is  the  load  vector  and 
N  is  the  number  of  equations.  In  the  case  of  large-scale  problems,  domain  decomposition  (DD)  provides  an  intuitive 
yet  powerful  framework  for  the  parallel  formulation  of  the  finite  element  method  (FEM).  If  Q.  is  the  original  problem 
domain  which  is  partitioned  into  fP  sub-domains,  then  Eq.  1  can  be  written  as 


(2) 


where  T  is  the  number  of  processors,  [Jf  i  denotes  the  “assembly”  operation,  Ks  is  the  coefficient  matrix,  K?  and  u' 
is  the  solution  vector,  and  fs  is  the  right-hand  side  vector  associated  with  the  each  partition  .v.  Here  partition  could  be 
finite-element-node  based  or  finite-element-edge  based  partition.  The  former  leads  to  row-partition  of  the  associated 
matrix.  This  forms  the  basis  for  most  of  the  parallel  implementations  of  the  iterative  solvers  [7, 17, 18].  In  this  paper 
we  assume  the  finite  element-edge  based  parition  and  each  sub-domain  is  mapped  into  each  processor  involved  in  the 
computation. 

The  rate  of  convergence  of  the  iterative  algorithms  are  governed  by  the  condition  number  of  the  resulting  linear 
systems,  which  increases  dramatically  for  finite  element  matrices  as  the  mesh  size  decreases.  As  a  result,  an  efficient 
preconditioner  is  required  for  the  solution  of  the  linear  systems  in  Eq.  1.  If  C  is  a  preconditioner,  then  Eq.  1  can  be 
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transformed  into 


Cz.Ku 

KCflV 

CLKCffv 


C/f, 

f.u  =  Cry,  or 
CLf.u  =  Csv 


(3) 

(4) 

(5) 


where  Cl,  C r  are  the  preconditioning  matrices  such  that  ClK,  KC«  and  C/KCr  are  approximately  equal  to  the  iden¬ 
tity  matrix  I.  The  Eqs.  3,  4  and  5  are  called  left-,  right-  and  split-preconditioners  [7]  respectively.  In  this  paper,  our 
attention  is  restricted  to  left-  and  split-preconditioners.  Serial  implementation  of  the  commonly  employed  precon¬ 
ditioners  are  incomplete  LU  factorization  (ILUfA;),  where  k  is  the  level  of  fill-in).  For  high  parallelizability.  Sparse 
Approximate  Inverse  (SPAI)  and  diagonal  preconditioners  are  commonly  used.  However,  SPAI  is  not  suitable  for 
element-based  domain  decomposition  strategies  and  diagonal  preconditioners  are  not  effective  enough  to  reduce  the 
number  of  iterations  for  large-scale  complex  problems. 

Diagonal  scaling  preconditioner  is  generally  employed  in  the  split-preconditioning  scheme  so  as  to  keep  the  sym¬ 
metry  of  linear  systems.  It  is  easy  to  construct  and  incur  little  communication  in  its  application.  However,  except 
for  a  few  cases  of  diagonally  dominant  linear  systems,  diagonal  scaling  preconditioner  cannot  effectively  reduce  the 
number  of  iterations.  Diagonal  scaling  operation  incur  no  communication  overhead  and  is  available  for  element-based 
domain  decomposition  (EDD)  method.  In  this  paper,  diagonal  scaling  preconditioner  is  utilized  as  an  indispensable 
pre-processing  tool  for  the  polynomial  preconditioner. 

The  interest  in  polynomial  preconditioners  [15]  is  motivated  by  the  need  for  simple,  yet  efficient  and  effective  methods 
for  efficient  parallel  iterative  solvers  [7]  for  high  performance  computers  (HPC)  such  as  vector  and  parallel  processors. 
In  contrast  to  ILU,  SPAI,  etc.  polynomial  preconditioners  are  constructed  only  based  on  the  the  matrix  spectrum  (i.e., 
eigvalues  of  the  coefficient  matrix),  which  is  denoted  as  o(K).  Even  though  o(K)  is  generally  difficult  to  compute, 
©,  an  approximate  estimation  to  it  can  be  easily  obtained.  When  K  is  a  symmetric  matrix,  then  0  C  91;  otherwise 
©  C  C ■  The  accuracy  of  ©  determines  the  rate  of  convergence  of  the  preconditioned  systems.  Since  polynomial 
preconditioning  will  always  keep  the  symmetry  of  linear  systems  [15],  in  this  paper  we  employ  left-preconditioning, 
Eq.  3,  to  transform  the  original  linear  systems,  Eq.  1,  to  yield 

Pm{K)Ku=Pm{K)f  (6) 

where  Pm ( K )  is  a  polynomial  in  K  with  degree  <  m.  It  has  been  shown  in  [15]  that,  if  Pm(X)  is  constructed  such  that 

min  |j  1  -~XPm(k)\\  (A,  €  ©)  (7) 

where  p,„  [©]  is  the  set  of  m-degree  polynomials  which  is  valid  over  ©,  and  j | .  |  represents  a  specific  norm,  namely,  the 
uniform  norm  or  quadratic  norm,  then  abound  on  the  condition  number  [7]  of  the  preconditioned  systems,  k(P,„(K)K), 
is  minimized. 

In  this  paper  the  coefficient  matrix  A  is  assumed  to  be  symmetric.  The  difficulty  in  applying  the  polynomial  pre¬ 
conditioner  lies  in  the  estimation  of  the  spectrum  of  coefficient  matrix.  This  can  be  easily  achieved  by  a  simple 
pre-processing  technique  that  maps  the  spectrum  of  the  coefficient  into  a  predetermined  interval.  This  is  discussed  in 
the  next  section. 

2.1.1  Diagonal  scaling 

Scaling  is  an  essential  pre -preprocessing  technique  applied  to  Eq.  1  before  the  polynomial  preconditioning  stage.  An 
appropriate  scaling  operation  is  essential  for  ready  application  of  the  polynomial  preconditioner.  Besides  reducing 
the  condition  number  of  linear  systems,  it  restricts  the  spectrum  of  the  coefficient  matrix  0(A)  into  a  pre-determined 
interval. 
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Neuman-Series  polynomial  preconditioner 


Figure  1 :  Residual  polynomial  of  the  Neumann-series  preconditioner. 

Theorem  1  Let  K=  [k  i ,  k2 ,  •  •  •  ,k;v]r  £3iNxN  be  an  irreducible  matrix  and  assume  that  A,max  is  the  largest  eigenvalue 
of  K,  then 

A-max  <  max  1 1  k,- 1 1 1 ,  k,  e  9FV  (8) 

i=  1 

where  ||k,-||i  =  Ylj=i  I  ki,j  I  its  called  the  discrete  L\  norm  of  vector  k,. 


For  the  proofs  of  the  above  theorem,  the  reader  is  referred  to  the  Gershgorin  theorem  [7].  Given  a  linear  system,  Eq.  1, 
consider  K  =  [ki  ,k2, •  •  •  ,kA?]r,  and 


D  =  dtag(  /-7-) 

V  “  t  v  on 

(9) 

where 

di  =  1 1  ki  ||  1  =  1  A;,;  1 

j=  1 

GO) 

Then,  the  Eq.  1  can  be  transformed  into 

Ax  =  b 

(11) 

where  A  =  DKD,  b  =  Df  and  x  =  Du.  According  to  Theorem  1,  it  follows  that 

o(A)  c  (0,1) 

(12) 

This  enables  the  construction  of  the  polynomial  preconditioner  for  Eq.l  1  such  that  ©  =  (0, 1).  In  the  following  two 
sub-sections,  the  Neumann  series  and  the  generalized  least-squares  preconditioning  polynomial  are  discussed. 
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Residual  Polynomial  on  (0,1) 


Residual  Polynomial  on  [-4,-1  ]U[7,1 0] 


Residual  Polynomial  on  (-6.0,-4.1)U(-3.9,-0.1)U(0.1,5.9)U(6.1,8.0) 


Figure  2:  Residual  polynomial  1  —XP,„(X)  for  various  ©  and  polynomial  degrees:  a)  ©  =  (0. 1 ,2.5  ),  b)  0  =  (-  4,  —  1)U 
(7,10),  and  c)  ©=  (-6.0,  -4.1)  U  (-3.9, -0.1)  U  (0.1, 5.9)  U  (6.1, 8,0). 

2.1.2  Neumann  series  polynomial  preconditioner 

The  Neumann  polynomial  preconditioner  is  the  simplest  polynomial  preconditioner  and  it  originates  from  the  basic 
algebraic  relation 

i 

— Y  =  V|A,|<l,A,e9l  (13) 

1  K  i= o 

The  following  theorem  may  be  used  in  the  construction  of  a  polynomial  approximation  to  the  inverse  of  the  coefficient 
matrix  A. 

Theorem  2  Let  G  £  9?^xjv,  then  Y£=oGk  converges  if  and  only  if  p(G)  <  1  where  p(G)  is  the  spectral  radius  of 
matrix  G.  Then,  it  readily  follows  that 

oo 

Y,Gk  =  (I-G)~1  (14) 

k= 0 

Let  coA  =  1(1  coA)  and  G  =  (I  —  coA)  where  to  £  91  is  the  scaling  factor.  Then,  from  Theorem  2,  the  inverse 
of  any  coefficient  matrix  A  is  given  by 

A-1  =0D(I  — G)-1.  (15) 

In  Eq.15  CO  can  be  adjusted  so  that  p(G)  <  1.  Then  it  follows  that  A-1  may  be  approximated  as  follows: 

oo 

A^1  =  co(£ G')  «  co(I  +  GH - h Gm_1)  (16) 

/=o 

Since  the  preconditioner  matrix  Pm( A)  «  A-1,  from  Eq.  16  it  follows  that,  the  preconditioner  Pm  (A)  is  given  by 

P,„(A)  =  co(I  +  G  +  G2H - h-G”')  (17) 

In  practical  implementations,  CO  would  be  determined  according  to  the  characteristics  of  A.  For  example,  if  A 
is  symmetric  positive  definite  then  one  can  select  ©  =  (0,  ~h).  Figure  1  shows  that  the  residual  polynomials  (1  — 
LP,n-  ]  (/V )),;;(  =  5,6,7  is  approximately  equal  to  zero  in  the  interval  Q.  =  (0,30). 
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2.1.3  Generalized  least-squares  polynomial  preconditioner 

The  generalized  least-squares  (GLS)  polynomial  preconditioning  is  preferable  to  other  polynomial  preconditioning 
methods  (i.e.,  Neumann  series,  least-squares,  Chebyshev  etc.)  due  to  its  high  application  flexibility  and  low  construc¬ 
tion  cost.  For  GLS  polynomial  preconditioning  method,  ©  is  defined  as  a  union  of  an  arbitrary  number  of  disjoint 
intervals  of  the  spectrum,  i.e., 


N, 

©  =  U  (4,10,0^©  and 

k=  1  ^1°-) 

l\  <  <  I2  <  ~h  <  ■  ■  ■  <  f/v/  <  ~kn 

where  Ni  is  the  number  of  intervals  in  which  the  eigenvalue  of  the  coefficient  matrix  K  lies.  Therefore,  the  GLS  method 
can  be  a  general  method  of  solving  symmetric  linear  systems  including  both  symmetric  indefinite  and  symmetric 
positive  definite  systems. 

Once  ©  is  determined,  the  GLS  polynomial  preconditioner  can  be  constructed  such  that 

min  ||  1  -XPm(X)\\w,  Xg&  (19) 

(Oji 

where  ||.||w  represents  the  weighted  quadratic  norm  induced  by  the  inner  product  ( f,g)w  =  and 

w(A)  is  a  non-negative  weight  function  over  the  interval  ©.  The  key  to  the  above  least-squares  problem,  Eq.  19,  is  the 
construction  of  a  sequence  of  orthogonal  polynomials.  Given  a  series  of  orthogonal  polynomial  sequence  { /,<),( /V) (l 
with  respect  to  the  inner-product  [15]  as  a  basis,  then  the  least-squares  problem,  Eq.  19,  can  be  readily  constructed 
via 

m 

Pm{X)  =  (20) 

(=0 

where  /r,-  =  ( 1 ,  Ax]),- (X))w.  As  a  result,  the  polynomial  preconditioning  can  be  computed  iteratively  as 

m 

Pm(  A)v  =  ^Pi<|>i(A)v.  (21) 

(=0 

In  this  work  the  orthogonal  polynomial  sequence  is  efficiently  constructed  using  Chebyshev  polynomials 

[15]  embedded  in  the  Stieltje  procedure  via  the  three-term  recurrence  relationship.  The  storage  and  time  cost  for  this 
operation  is  negligible  compared  to  ILU,  SPAI,  etc.  For  more  detailed  information  about  the  sequential  implementation 
of  the  polynomial  preconditioner,  the  reader  is  referred  to  [15]. 

Figure  2  (a),(b)  and  (c)  show  the  graphs  of  the  residual  1  — XPm(X )  for  the  GLS  polynomial  preconditioning  with 
respect  to  different  ©  and  m.  It  can  be  readily  shown  that,  if  ©  ss  o(  K),  then  Pm(K)K  «  I  leading  to  an  efficient 
preconditioner.  For  the  iterative  solvers,  the  polynomial  preconditioning  step 

z  =  Pm(A)y  (22) 

reduces  to  a  set  of  m  matrix-vector  product  operations  instead  of  incomplete  factorization  followed  by  backward 
and  forward  solve  operations.  In  such  a  case,  the  only  requirement  that  needs  to  be  satisfied  is  the  stability  of  the 
preconditioner,  which  is  discussed  in  the  following  section. 


2.2  Stability  of  polynomial  filtering 

Let  e  be  the  machine  precision  (roundoff  unit)  and  let 


Pm(  A)v 


(23) 
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Stability  of  Preconditioning  Polynomial 


Figure  3:  (l  |  a,  |  forGLS  preconditioning  polynomial. 

Let  Zfi  =  P,„( A)v  satisfy  the  floating  point  arithmetic,  and  z  =  Pm( A)v  be  the  exact  value.  Then  it  follows  from  [16], 

m 

||z//-z||2<me£  |a«- 1  .  (24) 

(=0 

The  inequality  Eq.  24  indicates  that  the  stability  of  the  preconditioning  operation  Pm  (A)v  mainly  depends  on  m,  the 
degree  of  the  preconditioning  polynomial  Pm  and  the  sum  of  the  absolute  values  of  the  coefficients  of  the  polyno¬ 
mial.  Figure  3  shows  the  norm,  Eq.  24,  of  various  Pm(k)  with  increasing  degree  of  polynomial  m  for  ©  (0,1) 

and  ©  =  (-4,-1)  U  (7, 10).  It  can  be  seen  that,  for  the  GLS  polynomial  preconditioning,  the  accumulated  error  in¬ 
creases  dramatically  as  the  degree  of  polynomial  increases.  It  is  clear  that  for  all  practical  purposes  the  degree  of  the 
polynomial  should  be  restricted  to  less  than  10. 

2.3  Sequential  flexible  GMRES 

In  this  paper,  the  generalized  minimal  residual  solver  is  chosen  due  to  its  general  applicability  to  a  wide  variety  of  ap¬ 
plications  including  problems  arising  in  numerical  formulations  of  structural  mechanics  applications.  To  formulate  the 
parallel  GMRES  for  the  EDD  strategy,  the  standard  (sequential)  description  is  given  first.  For  the  pre-processed  linear 
equations,  Eq.  11,  the  GMRES  solution  process  is  essentially  a  least-squares  process,  [7]  which  may  be  expressed  as 
follows: 

min  1 1  b  —  Ax.v  1 1 2  (25) 

xs=x0+A:  (A,r0) 

and  3Cs(A,ro)  =  span{ro,  Aro,  A2ro,  •  •  •  ,  Ai_1ro}  where  3G(A,ro)  is  an  s-dimensional  Krylov  subspace  [7],  and  ro  = 
b  —  Axo  is  the  initial  residual  employing  xo  as  the  initial  approximation  to  the  required  solution.  The  GMRES  solves 
Eq.25  by  constructing  an  orthonormal  basis  {vi,V2,---  ,Vs}  for  ?Cv(A,ro)  using  the  Arnoldi  algorithm  [7]  and  then 
finding  the  optimum  solution  based  on  this  basis.  In  this  paper  a  variant  of  the  GMRES  which  is  known  as  the 
flexible  GMRES  (FGMRES)  [7]  is  employed.  This  variant  permits  the  easy  construction  of  different  preconditioners 
at  required  stages  in  the  iterative  process  of  the  GMRES  solver.  FGMRES  differs  from  GMRES  in  that  the  solution 
updates  are  constructed  using  the  preconditioned  variable  z  j  instead  of  basis  vy.  The  sequential  implementation  of  the 
FGMRES  is  described  in  Algorithm  1 . 

Algorithm  1  Sequential  flexible  GMRES  with  restart  and  preconditioner 

(1)  Start:  choose  xo,  m 

(dimension  of  the  Krylov  subspace) , 
setup  (m+l)-by-ra  matrix  H. 

(2)  Arnoldi  process : 

(3)  r0  =  b  — Ax0;  PHM2;  v0  =  r0/p; 

(4)  FOR  j  =  0,  ■  •  •  ,m  —  1  DO  until  convergence 

(5)  Zj  =  Cvjj 
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(a)  finite  element  mesh  partition 


(b)  Associated  matrix  partitions 


Figure  4:  Non-overlapping  finite  element-based  domain  decomposition. 


(6) 

Vj+1  =Az  j 

(7) 

FOR  (i  =  0,  ■  •  ■  ,j ) 

Ki  =  <vj+i>v,); 

(8) 

FOR  (;  =  (),-■■  J) 

Wj  =Wj-hijvn 

(9) 

hj+i,j  =  Ily/+i  lU; 

vy+i  =  yi+i/hi+i,j 

GO) 

ENDFOR 

(11) 

Define  Z;  =  [z,v  ,Z/1 

and 

H;  =  {hi,k}  (0<i<k+l,0<k<j); 

(12)  Compute  the  approximate  solution: 

(13)  Xj=x0  +  Zjyj 

such  that  y j  =  argmaiy\\$ei  —  Hjy\\2\ 

(14)  Restart:  IF  (not  convergent)  xo  =Xj  and 

GOTO  (2); 


From  the  above  discussions  it  is  clear  that  the  polynomial  preconditioner  has  a  significant  potential  to  many  practical 
problems.  However,  for  it  to  be  readily  employed  for  finite  element  applications  involving  large  scale  problems 
an  efficient  parallel  implementation  is  required.  In  the  next  section,  an  efficient  implementation  of  the  polynomial 
preconditioning  based  GMRES  for  finite  element-based  partitioning  is  presented.  This  is  followed  by  the  parallel 
implementation  for  row-partitioning  of  the  matrix  that  results  due  to  the  finite  element  node -based  partitioning. 

3  Finite  element-based  domain  decomposition  based  GMRES 

In  this  section,  the  distributed  formulation  of  linear  systems,  Eq.  1,  arising  from  non-overlapping  finite  element-based 
domain  decomposition  strategy  is  discussed.  Typical  steps  involved  in  finite  element-based  domain  decomposition 
computation  are  listed  in  Algorithm  2  for  illustration.  The  time-consuming  kernels  in  iterative  methods  require  an 
efficient  implementation  which  depends  on  three  basic  operations.  These  are  vector  operations  namely,  vector-update 
and  inner-product,  matrix-vector  product,  and  the  preconditioning  solver  phase.  This  is  followed  by  norm-1  diagonal 
scaling  technique  in  conjunction  with  the  polynomial  preconditioned  FGMRES  discussed  next. 

Algorithm  2  Finite  element-based  domain  decomposition  method 

(1)  Discretize  the  domain; 

(2)  Partition  £1  into  T  non-overlapping 
sub-domains  in  terms  of  element; 

(3)  Compute  elemental  stiffness  matrix  Ke; 
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(4)  Compute  sub-domain  stiffness  matrix 
KW  from  Ke; 

(5)  Apply  boundary  condition  over  \  T ; 

(6)  Solve  the  system  of  equations  employing 
preconditioned  GMRES  solver; 


3.1  Data  structures  and  computational  kernels 

3.1.1  Data  structures  for  Matrix/vector  operations 

The  following  two  definitions  are  required  for  a  further  discussion. 

Definition  1  A  global  variable,  vector  u  or  matrix  K  in  a  local  distributed  format  is  denoted  by  u^  and  if  it  only 
contains  the  local  data. 

Definition  2  A  global  variable,  vector  u  or  a  matrix  K  in  a  global  distributed  format  is  denoted  as  u  W  and  KW  if  it 
contains  the  data  that  are  same  at  the  sub-domain  interfaces. 

These  definitions  are  explained  as  follows.  Consider  a  1-D  finite  element  truss  problem  made  of  two  elements  as 
shown  in  Figure  5(a).  It  follows  that  a  global  vector  u  can  be  formulated  as  global  distributed  vectors  (,v  =  1 , 2)  or 
local  distributed  vector  u^)  (,v  =  1,2).  In  addition.  Figure  5(b)  shows  that  the  global  distributed  vectors  store  the  full 
value  of  the  global  vector  u  with  respect  to  the  subdomain  s,  while  the  local  distributed  vector  contains  the  data  only 
attributed  to  the  local  elements  in  the  sub-domain  s.  It  should  be  noted  that  if  '-1  f  if  '1. 


Let  Bs  €  9vv'xA'  represent  an  unsigned  boolean  matrix  which  maps  the  global  vector  into  the  sub-domain  s.  It  follows 
that 

S’ 

uw=B.su;  and  u=£bJu(s)  (26) 

j=l 

A  local  distributed  vector  u^  can  be  converted  into  a  global  distributed  vector  through 


fiW  eeB.v£bJuW 

5=1 


(27) 


This  operation  can  be  efficiently  implemented  by  restricting  the  communication  context  within  the  neighboring  sub- 
domains  of  .s'  represented  as 

dQs 

u('s)  =1X1  £  uW  (28) 

where  d€ls  is  the  interface  of  the  sub-domain  s.  Here  we  refer  to  this  operation  as  a  nearest  neighbor  communication 
[6].  For  details  of  the  implementation  the  reader  is  referred  to  [6].  Analogous  to  vectors,  a  global  stiffness  matrix  K 
also  consists  of  a  series  of  global  distributed  sub-matrices  or  a  series  of  local  distributed  sub-matrices  k-d.  Again, 
referring  to  the  1-D  truss  problem  (Figure  5(a)),  the  global  stiffness  matrix  K,  the  local  distributed  stiffness  matrix, 
K(d  (s  =  1 , 2),  and  (s  =  1,2)  which  is  the  global  distributed  stiffness  matrix  are  given  by 


K 


kd) 

kd) 


0  -1  1 


1 

-T 

s 

II 

l 

-T 

/ 

-1 

l 

'  ~T 

-l 

l 

1 

-r 

*> 

s 

II 

2 

-T 

z 

-1 

2 

"  ~T 

-1 

l 

(29) 

(30) 

(31) 
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local  distributed  vector  global  distributed  vector  global  vector 


(a)  1-D  FEM  truss  problem  (b)  Local  and  global  distributed  vector 

Figure  5:  Data-structure  forEDD  strategy. 


where  /  is  the  element  length,  j?  is  cross  sectional  area,  and  £  is  Young’s  modulus.  It  is  clear  that,  in  the  general  case, 

a> 

K(s)  =  B[  KBs;  and  K  =  £  B[  K^Bs  (32) 

S=1 

Here  K  is  the  global  stiffness  matrix  for  a  domain  Q.  with  homogeneous  Neumann  boundary  conditions  on  the  inner 
boundaries  dQ.. 


3.1.2  Vector,  matrix- vector  and  preconditioning  solver  operations 

In  the  implementation  of  iterative  methods  for  the  solution  of  a  linear  system  of  equations,  four  basic  computational 
kernels,  namely,  the  vector  update,  inner-product,  matrix-by-vector  product  and  preconditioning  are  needed.  This 
section  discusses  their  efficient  implementation  for  element-based  domain  decomposition. 

Vector  operations 

Vector  updates  do  not  require  any  communication  and  the  left-hand-side  vector  can  be  always  determined  by  the  local 
quantities,  i.e.,  DAXPY  (y  <—  ax  +  y)  is  accomplished  via  yW  <—  ax^  +y^,s  =  1,2,  •  •  •  ,  £. 

The  inner  product  operation  (x,y)  where  x,y  G  9vv,  is 

(x,y)  =  (If=1BsxW,Ef=1BsyW) 

=  Lf=i(y(i))r(B[Lf=1B4w)  (33) 

=  Lf=i(y(s),xW) 

Similarly,  it  can  be  shown  that 

(x,y)  =  f  (xW,yW)  (34) 

s=i 

It  then  follows  from  Eq.  33  and  34  that  if  both  x  and  y  are  stored  in  a  local  distributed  format,  then  y  =  (x,y)  can  be 
computed  via 

(  (l)  yM=c<IansyW 

\  (2)  y(*)  =  <x«y«  (35) 

{  (3)  y  =  VLy'i) 

Matrix-by-vector  operation 
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The  matrix- vector  operation  y  <—  Kx,  where  x,y€  9vv  and  K  £  3iNxN,  is  the  only  stiffness-matrix-related  subroutine 
in  this  implementation.  Since 

Kx=(£  bJk«b5)(£  BjxW) 

*=1  S=1  (36) 

=  £  Bj  K(i¥J) 

i=t 

then  storing  y  in  a  local  distributed  format  such  that  y  =  Bj  y,,vi,  it  follows  that 

/*)  =  K(l¥s)  (37) 

Preconditioning 

The  application  of  the  preconditioning  is  similar  to  that  of  a  matrix- vector  product.  If  the  preconditioner  C,  is  available 
in  a  local  distributed  matrix  format,  C^,  then  C  =  (L.f=i  Bj  C(';iBv).  It  readily  follows  that,  the  preconditioning 
operation  can  be  written  as 

w  =  Cu  =  (£  B[C^B.s)(£  BjuW) 

S=1  S=1 

S=l  .5=1 

This  implies  that 

ww  =  C(l)B.v  £  B.fu(s)  =  Cwuw 

«=i 

Note  that,  it  is  assumed  that  a  preconditioner  is  available  in  each  processor  which  does  not  need  nearest  neighboring 
or  global  communication  for  its  construction. 


(38) 


(39) 


3.2  Parallel  implementation 

In  this  section,  the  parallel  implementation  of  the  diagonally  scaled  polynomial  preconditioned  flexible  GMRES  solver 
employing  the  previously  discussed  data  structures  and  computational  kernels  is  presented. 


3.2.1  Norm-1  Diagonal-scaling 


If  the  original  problem  is  partitioned  according  to  the  finite  element-based  domain  decomposition,  then  Eq.  2  in  terms 
of  the  local  distributed  format  yields 

|ukw||u«(^=u^  (40) 

The  partition  K  =  [k i ,  k2 ,  •  •  •  .  k.y  7  described  in  the  Theorem  1  in  terms  of  local  distributed  format  yields  K!v|  = 

■  k^'!  7  .  Let  norm-1  diagonal  scaling  matrix  D  be  formulated  in  the  global  distributed  format,  i.e.,  D  = 
U.f=i  D(,s).  Then 


where 


D(-s)  =diag(l/y//Jp,. 

i /\Z¥^) 

•  •  > 1  /  V  aNs  ) 

(41) 

df  =  «I4'’ 

Ns  f  ^ 

(42) 

Co 

II 

II 

H  4) 

j= i 

(43) 

Eqs.  41  and  43  can  be  readily  computed  using  the  following  algorithm. 
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Algorithm  3  Construction  of  Diagonal  Scaling 

(1)  FOR  0  =  1,2, ■ • ■  ,NS)  d,W  =  11^  ||i ; 

(2)  assign  $A  =  [d[s\-  ■  ■  ,d$s]T ; 

(3)  =[x] 

(4)  FOR  (i=  1,2,  - 

v" 


Employing  Algorithm  4,  the  norm-1  diagonal  scaling  matrix  D  can  transform  Eq.  40  into  a  linear  system: 


(44) 


where  AW  =  D^K^dW,  xW  =  D^uW  and  bW  =  DWfW.  The  polynomial  preconditioned  solvers  based  on  norm-1 
diagonal  scaling  is  described  by  the  algorithm  below: 


Algorithm  4  Application  of  Diagonal  Scaling 


(1)  AM  =  (6M)r£(s>6M; 

(2) 

(3)  Construct  preconditioning  polynomial  Pm 
with  0  =  (0, 1) ; 

(4)  Solve  (UfAW)(Ufx(l))  =  UfbW  using 
EDD-FGMRES  with  preconditioner  P„,( A); 

(5)  dW  =  f>M*M  ; 


Based  on  assumption  that  0  =  (0, 1 ),  a  parallel  polynomial  preconditioned  FGMRES  for  Eq.  44  is  formulated  and 
described  below. 


3.2.2  FGMRES  for  element-based  partitioning 

Following  Algorithm  1  and  the  computational  kernels  described  above,  the  finite  element-based  domain  decomposition 
flexible  GMRES  solver  for  the  system  of  equations,  Eq.  44,  can  be  readily  derived  which  is  listed  in  Algorithm  5. 

Algorithm  5  Restarted,  polynomial  preconditioned  (Pm)  EDD-GMRES: 

(1)  PARTI:  Initialization 

G\ 

(2)  choose  Xq  , 

m  (dimension  of  the  Krylov  subspace) ; 
convergent  =  FALSE; 

(3)  PART2:  Arnoldi  process 

/*  SECTION  2.1:  Formulate  '|js)  */ 

(4>  =M£afli4l); 

(5)  fW=bW-AWw0; 

(6)  w<J)  =m£3£*g£); 

(7)  p  =  (VEsl(w0,v'f)»L 

(8)  v's)=4j)/P; 

/*  SECTION  2.2: 

Classical  Gram-Schmidt  process  */ 

(9)  FOR  y  =  0, •  •  •  ,m  —  1  DO 

(10)  /*  preconditioning  Algorithm  7  */ 

iW  «-Pm(AW)fW 

(11)  w<‘>  =lxiE3n»z(s); 

(12)  /*  matrix-vector  product  */ 

=AWwW 

(13)  w0=m£  acisy{jll-, 

(14)  FOR  0  =  0 Aj=V£fl(w0,ff)); 

(15)  FOR  0  =  0,. ..J)  ; 
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(16) 

wM  =m£31W^i; 

(17) 

/>,+u  =  (VE^tp*(s)»L 

/*  hj+ij  =  Hv'ti  lb  */ 

(18) 

IF  hj+ij  <  £  THEN 

(19) 

convergent  =  TRUE;  BREAK; 

(20) 

END  IF 

(21) 

/*  normalize  */ 

✓  (s)  ,{s)  /, 

(22) 

ENDFOR 

/*  SECTION  2.3:  Formulate  Krylov 
subspace  (spanned  by  {z j}™=\) 
and  Heisenberg  matrix  H,,,*/ 

(23)  Define  =  [z^ , •  •  •  ,z^]  and 

H j  =  {hiyk}  (0  <  /  <  k+  1,0  <  k  <  j) ; 

(24)  PART3:  Compute  the  approximate  solution: 

(25)  =  Xq5)  +  yj  such  that 
yj  =argminy\\$ei  -Hy-y||2; 

(26)  PART4:  Restart: 

(27)  IF  (not  convergent)  THEN 

(28)  x(0s)  =  if  ;  GOTO  (3)  ; 

(29)  END IF 

A  typical  Arnoldi  iteration  consists  of  'in  Gram-Schmidt  loops  where  "in  is  the  dimension  of  the  Krylov  subspace 
Therefore  for  “ m  Gram-Schmidt  computations,  there  are  0(  fanner-products  to  form  the  Heisenberg  matrix  H  -m  A 
naive  implementation  will  result  in  0(  "in)  collective  communications.  This  overhead  will  increase  as  the  number  of 
processors  increases  as  the  collective  communication  overhead  is  0  (  log  1’  ).  Therefore,  an  efficient  implementation  is 
required  which  is  achieved  by  restricting  this  overhead  to  0(  'in)  collective  communications  in  statement  14  of  Algo¬ 
rithm  5. 

In  the  Algorithm  5,  for  each  Arnoldi  iteration  in  the  statements  9-24,  there  exist  three  nearest  neighbor  communication 
operations  ix  £',a’  namely,  statements  13,  15  and  18.  In  order  to  enhance  the  parallel  scalability  of  the  Algorithm  5 
further,  these  operations  should  be  minimized.  This  can  be  achieved  by  introducing  preconditioning  vectors  z  in  the 
global  distributed  format  instead  of  the  basis  vectors  in  the  global  distributed  format.  This  reduces  the  nearest  neighbor 
communication  operation  to  one  instead  of  three.  The  details  of  this  enhanced  parallel  GMRES  solver  is  described  in 
Algorithm  6. 

Algorithm  6  Enhanced  EDD-FGMRES  with  restart  and  preconditioner  Pm: 

(1)  PARTI:  Initialization 

fA 

(2)  choose  xX  , 

m  (dimension  of  the  Krylov  subspace) ; 

convergent  =  FALSE; 

(3)  PART2:  Arnoldi  process 

/*  SECTION  2.1:  Formulate  VqS)  */ 

(4)  fW=bM-AMiW; 

(5)  f W  =1X1  ; 

(6)  P  =  (V£n<v<j),v<j)»L 

(7)  vW=v<s)/p 

/*  SECTION  2.2: 

classical  Gram-Schmidt  process  */ 

(8)  FOR  j  —  -  DO  until  convergence 

/*  polynomial  preconditioning  Algorithm  7  */ 

(9)  if  <-Pm(AW)f^); 

(10)  if  =ixi  £3£is  zf  ; 

(11)  /*  matrix-vector  product  */ 

vfl=A^zf ; 

(12) 


13 


(13) 

FOR  G  =  0 ,...J)  */,;  =  VLn(fWlrf© 

(14) 

FOR  (;  =  0,...J)  DO 

(15) 

V©  -y©  -fo  v©- 

yj+l  ~~  yj+ 1  n*Jvi  ’ 

(16) 

V©  -v©  —h  v©  • 
yj+ 1  —  yj+ 1  n‘:jyi  ’ 

(17) 

ENDFOR 

/*  Gram-Schmidt  process  */ 

(18) 

/q,,./i(VL“{v©,  .V©,))’; 

(19) 

IF  hj+ij  <  £  THEN 

(20) 

convergent  =  TRUE;  BREAK; 

(21) 

ENDIF 

(22) 

tfh  =yjii/hj+i,j’ 

(23) 

ENDFOR 

/*  SECTION  2.3: 

Formulate  Krylov  subspace 

(spanned  by  {z ;}£=1) 

and  Heisenberg  matrix  H j  */ 

(24) 

Define  =  [z^ , •  •  •  , z^ ]  and 

H;  =  {hj,k}  (0<i<k+l,0<k<j); 

(25)  PART3:  Compute  the  approximate  solution: 

(26)  xf^x©+Z©y, 

such  that  y;- =  ar^miny ||P^i  —  Hyy||2 ; 

(27)  PART4:  Restart: 

(28)  IF  (not  convergent)  THEN 

(29)  x^  =  X© ;  GOTO  (3)  ; 

(30)  ENDIF 


3.2.3  Polynomial  Preconditioning 

The  necessary  polynomial  preconditioning  operation,  in  the  Algorithm  5, line  10,  and  Algorithm  6,  line  10,  the 
Neumann-series  polynomial  preconditioning  is  described  below.  The  application  of  the  generalized  least-squares 
polynomial  preconditioning  (whose  construction  is  discussed  in  [15]  is  formulated  in  a  similar  manner. 

Algorithm  7  EDD  Neumann  Series  Polynomial  Preconditioning  rJs>  <—  Pm( A^)v^  .  m  and  ©  =  (0,  ~h)  are  input 
parameters. 

(1)  CO  =i; 

(2)  =cxiLan'vW; 

(3)  z ©)  =  f M  —  ©A^w^ ; 

(4)  FOR  (i  =  1,2,---  ,m)  DO 

(5)  z(«)  =  z(s)  _)_  yW  ; 

(6)  ffW  =cx]£ao*z(s); 

(7)  =  z^  —  CoA^w^) ; 

(8)  ENDFOR 

(9)  z(s)  =  co(v^)  +z^) ; 


Alternative  preconditioning  methods  such  as  ILU  and  SPAI  may  face  drawbacks  for  finite  element  based  domain 
decomposition  methods.  For  example,  the  incomplete  ILU  preconditioner  in  the  local  distributed  format  computed  via 

(C^r1  =  L(l)U(s)  =  K(i)  (45) 

may  occasionally  suffer  from  singularity  of  K(©.  In  structural  mechanics,  if  sub-domain  s  does  not  contain  sufficient 
number  of  Dirichlet  boundary  conditions  such  that  it  “floats”,  then  K!'v)  will  be  singular. 
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(a)  Finite  element-node  based  partitioning 


(b)  Associated  matrix  row  partition 


Figure  6:  Storage  format  of  row-base  domain  decomposition  linear  systems. 


4  Row-based  Partition  Strategy 

For  the  finite  element  method,  if  a  node-based  partitioning  is  employed  (see  Fig.  6(a)),  then  the  associated  matrix 
partitioning  is  a  block  row  partitioning  (see  Fig.  6(b)).  Efficient  parallel  implementation  of  iterative  solvers  based 
on  block  row  partitioning  of  the  matrix  are  implemented  in  many  popular  libraries  such  as  such  as  pARMS  [19], 
PSPARSLIB  [18],  Aztec  [20],  etc.  To  compare  these  implementations  with  the  finite  element  based  domain  decompo¬ 
sition  based  GMRES  described  previously,  the  row -based  partitioning  implementation  is  presented  in  this  section. 

4.1  Data  structures  and  computational  kernels 

4.1.1  Data  structures 

Row-based  domain  partitioning  provides  a  solution  method  for  a  general  purpose  linear  system  of  equations.  Using 
specific  graph  methods  [21],  the  degrees  of  freedom  (d.o.f)  and  associated  equations  (rows  of  the  stiffness  matrix)  are 
partitioned  and  assigned  to  different  processors.  Referring  to  Figure  6,  let  the  original  domain  be  partitioned  into  2’ 
nodally  disconnected  sub-domains.  Then 

K(i)  =  B,.K;  and  uW  =  B.vu  (46) 

It  is  clear  from  the  above  equations,  that  the  matrix  and  vectors  resulting  due  to  finite  element  node -based  partitioning  is 
the  row  partitioning  of  the  global  matrix  and  vectors.  Hence  the  matrix  and  vectors  are  always  in  the  global  distributed 
format. 

4.1.2  Vector,  matrix- vector,  preconditioning  operations 

Similar  to  the  element-based  partitioning,  the  node-based  partitioning  also  requires  three  basic  operations,  namely, 
vector,  matrix-vector  product,  and  preconditioning  operations. 

Vector  operations 

As  in  the  element-based  algorithm,  the  vector  update  operations  for  the  RDD  strategy  do  not  require  any  interprocessor 
communications.  The  inner  product 

<x,y>  =  I<s(,),yw>  (47) 

£=1 
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(a)  Adjacency  of  R-DD  method 


(b)  Matrix  data  structure  of  the  sub-domain 


Figure  7:  Structures  of  the  finite  element  row -based  partitioning 


is  accomplished  by  a  local  inner  product  operation  in  each  processor  followed  by  a  collective  all  reduce  communica¬ 
tion. 


Matrix-vector  operations 

The  Matrix-vector  product  operation  of  row-based  partitioning  is  relatively  more  complex  than  that  of  element-based 
partitioning  because  the  local  d.o.f  and  the  associated  equations  need  to  be  de-coupled. 

Figures  7(a)  and  7(b)  distinguishes  three  types  of  DOFs  for  row-based  linear  systems  into  interior  DOFs  (x [*',),  internal 
(s)  (s) 

interface  (xLT)  and  external  interface  DOFs  (x,,A-,)  [7, 19].  Further,  as  illustrated  in  Figure  7(a)  and  7(b),  the  local 

equations  are  also  de-coupled  into  two  components,  namely,  the  local  matrices  K^j,  and  K.J,' j .  The  matrix-vector 
product  operation  y  <—  Kx  can  be  accomplished  by 


scatter  x^d  to  neighboring  sub  —  domains', 
gather  XgAj  from  neighboring  sub  —  domains', 

y{s)  =  Kiocsioc’ 

yW=yM+K^Xrt; 


(48) 


It  should  be  noted  that  a  re-ordering  of  local  DOFs  is  required  in  order  to  achieve  satisfactory  parallel  performance. 

Precondition  operations 

In  [19],  additive  Schwartz,  Schur  complement  and  ILU  methods  are  employed  as  preconditioners  for  a  row-based 
linear  system  .  These  methods  are  actually  the  extensions  of  the  block  Jacobi  method  [7]  whose  kernel  is  to  solve  the 
local  system 


In  addition,  an  effective  implementation  of  the  GLS  polynomial  preconditioning  is  critically  assessed  in  [15],  In  the 
same  manner  as  the  EDD  strategy,  after  employing  the  norm-1  diagonal  scaling  as  a  pre-process,  the  original  linear 
system  is  transformed  into 


(49) 


Here,  it  should  be  remarked  that  the  norm-1  diagonal  scaling  for  the  RDD  strategy  requires  no  inter-processor  com¬ 
munication,  while  its  matrix-vector  product  for  the  RDD  strategy  may  incur  more  communication  overhead  than  the 
EDD  strategy. 
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4.1.3  FGMRES  for  row-based  partitioning 


Following  Algorithm  1  and  the  computational  kernels  described  above,  the  finite  row-based  domain  decomposition 
flexible  GMRES  solver  to  solve  the  system  of  equation  Eq.  49  can  be  readily  derived  which  is  listed  in  Algorithm  8. 

Algorithm  8  RDD-FGMRES  with  restart  and  polynomial  preconditioner  Pm: 

(1)  PARTI:  Initialization 

_(s) 

(2)  choose  Xq  , 

m  (dimension  of  the  Krylov  subspace) , 
setup  a  (m+l)-by-m  matrix  H. 

(3)  PART2:  Arnoldi  process 

/*  SECTION  2.1:  Formulate  VqS)  */ 

(4)  =bW  -MatVec(k^\^)-,  (Eq.  48) 

(6)  P  =  (VIn{v<s),v(j)))i; 

(7)  yf  =  VqS)/P; 

/*  SECTION  2.2: 

Classical  Gram-Schmidt  process  */ 

(8)  FOR  7  =  0, •••  ,m  —  1  DO 

(9)  /*  polynomial  preconditioning  */ 

if  =Pm(Ms))yfi 

(10)  /*  matrix-vector  product  */ 

=  MatVec(A$ ,  ) 

(ID  for  0  =  0,.. .,f)  hiJ=Vt1{vfvyf)-, 

(12)  /*  Gram-Schmidt  process  */ 

FOR  0  =  0,..., j) 

G3)  /*  h]+1, 3  =  ||v^ilb  */ 

(14)  IF  hj+ij<£  THEN 

convergent  =  TRUE;  break; 

ENDIF 

(15)  vf-t—ifl/hj+lji  /*  normalize v^j  */ 

(16)  ENDF0R 

/*  SECTION  2.3: 

Formulate  Krylov  subspace 
(spanned  by  {zjYj=1) 

and  Heisenberg  matrix  Hj*/ 

(17)  Define  zf  =  [if,- ■  ■  ,if]  and 

H,„  =  {/i;,A-}  (0  <  i  <k+  1,0  <  k  <  j) ; 

(18)  PART3:  Compute  the  approximate  solution: 

(19)  xf  +zfyj  such  that 

Yj  =  orgmin^lfflei  -Hyylb; 

(20)  PART4:  Restart: 

SW  _SM. 

X0  “  X7  ’ 

(21)  IF  (not  convergent)  THEN  GOTO  (3) ; 

_/A 

/*  Xq  stores  the  final  solution  */ 


5  Comparison  of  finite  element  and  row-based  domain  decomposition 

The  computational  complexity  of  each  inner  Arnoldi  process  of  the  GMRES  solver  for  both  element -based  and  row- 
based  strategies  are  given  in  Table.  1 .  Asymptotically,  both  algorithms  have  the  same  parallel  run  time  Tp;  making  both 
algorithms  equally  scalable  on  a  wide  range  of  parallel  architectures  for  finite  element  analysis.  However,  the  actual 
performance  difference  between  the  two  algorithms  depends  upon  the  actual  constants  associated  with  complexity  of 
communication  overhead  for  the  node  and  element  grid  partitioner  on  the  same  finite  element  grid. 

In  both  algorithms,  the  main  operations  are  dAXPY  operations,  vector  inner-products  computations  and  matrix- 
vector  multiplication.  The  dAXPY  operations  do  not  involve  any  communication  overhead.  The  communication  time 
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Method 

Neighbor  comm. 

Global  comm. 

mat-vec 

Inner-product 

vector-updates 

Algorithm  5 

m+3 

”m  + 1 

m+1 

0(  “m) 

0(  “m  +  m) 

Algorithm  6 

m+1 

”m  + 1 

m+1 

0(  “ m ) 

0(  “m  +  m) 

Algorithm  8 

m+1 

”m  4- 1 

m+1 

0(  “m) 

0(  ”m  +  m) 

Table  1 :  Communication  and  computation  cost  of  inner  Arnoldi  process  of  GMRES  solver  (  “m  is  the  size  of  Krylov 
subspace) 


per  iteration  for  a  vector  inner  product  depends  only  on  the  number  of  processors  in  use.  This  time  is  O  (log  1’ )  on  the 
hypercube  and  the  HiPPI  switch  based  architecture,  and  o{s/r)  on  a  mesh  based  architecture.  In  the  presence  of  a 
fast,  hardware  supported  reduction  operation  we  can  assume  that  it  is  a  small  constant.  Hence  the  only  performance 
between  element-based  and  row-based  algorithms  lies  in  matrix-vector  product.  For  element-based  matrix-vector 
multiplication,  the  communication  only  happens  among  the  nearest  neighboring  processes;  for  row-based  matrix- 
vector  multiplication,  more  processes  will  get  involved  in  communication. 


Figure  8:  Element  assignment  to  different  processors  for  node  based  partition  mesh. 

In  finite  element  analysis  for  an  unstructured  finite  element  mesh  the  resulting  coefficient  matrix  is  sparse  with 
an  irregular  sparsity  pattern.  Earlier  work  in  sparse  parallel  matrix- vector  multiplication  [22]  concluded  that  the  effi  - 
ciency  of  such  algorithms  depends  on  the  data  structure  used  and  the  sparsity  pattern  of  the  matrix  K.  Furthermore, 
a  scalable  parallel  implementation  of  a  sparse  matrix-vector  multiplication  exists  for  a  sparse  matrix  K  provided  that 
it  is  the  adjacency  matrix  of  a  planar  graph  1  G(K)  [21],  Since  row-based  GMRES  algorithms  require  efficient 
parallel  matrix-vector  multiplication,  both  the  search  direction  computation  and  the  polynomial  preconditioning  com¬ 
putation  which  increases  with  the  degree  of  polynomials  for  higher  order  elements  such  as  a  4-noded  quadrilateral  and 
the  8-noded  quadrilateral,  etc.,  it  makes  G(K)  non-planar  and  thus  deteriorates  the  scalability  of  parallel  implemen¬ 
tation.  Furthermore,  in  finite  element  analysis,  in  contrast  to  the  element-based  algorithm  the  row-based  algorithm 
has  the  following  drawbacks.  Referring  to  Figure  8,  which  shows  elements  assigned  to  each  processor  for  the  mesh. 
Figure  8(a)  shows  an  example  which  is  partitioned  into  three  subdomains.  Note  that  all  the  elements  sharing  the  inter¬ 
face  nodes  must  be  assigned  to  the  processors  to  avoid  the  assembly  of  interface  contribution  to  the  global  matrix  K 
which  increases  communication  overhead  due  to  gathering  of  entries  of  K  from  the  neighboring  processors  resulting 
in  duplicate  assignment  of  the  elements.  This  leads  to  two  further  drawbacks: 

1.  For  large  finite  element  meshes  and  especially  for  three-dimensional  problems  the  computer  storage  require¬ 
ments  may  increase  drastically; 

2.  This  will  incur  significant  redundant  floating-points  operations  for  all  the  nodes  assigned  to  the  processors  which 
are  part  of  extra  elements  assigned  and  are  not  part  of  the  original  nodes  in  that  partition. 

It  is  clear  from  the  above  discussion  that  element  domain  decomposition  based  GMRES  algorithms  seem  to  be  suitable 
for  general  parallel  finite  element  analysis,  especially  for  unsymmetric  matrices. 

1 A  graph  is  planar  if  and  only  if  it  can  be  drawn  in  a  plane  such  that  no  edges  cross  each  other.  In  finite  element  analysis,  G(K)  is  planar  for  a 
3-noded  triangular  element. 
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Figure  9:  Problem  description. 


6  Numerical  Results 

In  this  section  numerical  results  for  two  dimensional  2nd  order  elasticity  and  elastodynamics  problems  employing  the 
present  parallel  polynomial  preconditioners  are  presented. 


6.1  Problem  description 

Employing  the  finite  element  method  for  solving  2nd  order  elasticity,  results  in  the  discrete  equation  system 


Ku  =  f, 


(50) 


where  K  £  <RNxN  js  the  global  stiffness  matrix,  u  £  9i  v  and  f  £  9iv  are  the  displacement  and  externally  applied  load 
vectors  respectively,  and  N  is  the  number  of  degrees  of  freedom.  Similarly,  employing  finite  element  formulation  on 
the  elastodynamics  equations,  results  in  following  semi -discrete  equation  system 


u 

M-^2-  +Ku  =  f;  u(0)=uo;  u  =  uo, 


(51) 


where  M  £  9iVx,v  is  a  positive  definite  mass  matrix.  Observe  that  Equation  51  is  a  2ul  order  ordinary  differential 
equation.  Once  time  marching  techniques  such  as  the  family  of  generalized  integration  operators  [6]  are  employed  for 
time  discretization,  it  results  in  a  system  of  linear  equations  of  the  form 


Mu„+i  =  [aM  +  (3K]  u„+i  =  f„+i , 


(52) 


where  M  is  the  effective  stiffness  matrix,  a  and  [j  are  free  parameters  associated  with  specific  time  integration  method, 
and  n  +  1  is  the  current  time  step.  For  illustration,  the  Eqs.  50  -  52  are  solved  for  a  two-dimensional  cantilever  problem 
described  in  Fig.  9(a)  employing  four-node  quadrilateral  finite  elements.  The  parallel  flexible  GMRES  algorithms 
described  previously  are  implemented  using  C  and  MPI.  The  finite  element  meshes  of  increasing  problem  size  are 
shown  in  Table  2  where  nXelexnYele  describes  the  mesh  dimension  in  the  unit  of  element,  nNode  is  the  total 
number  of  nodes  in  the  mesh  and  nEqn  is  the  degrees  of  freedom.  Since  all  systems  undergo  diagonal  scaling  prior 
to  computation  of  the  solution,  ©  can  be  simply  defined  as  0  =  (e,  1)  where  e  is  the  machine  precision.  The  value 
of  “m  (refer  to  Algorithm  1),  the  dimension  of  the  Krylov  subspace  is  chosen  to  be  25  and  convergence  is  achieved 

when  ttt-  <  tol,  where  r,  is  the  ;-th  residual  vector  (b  —  Ax,).  The  value  of  tol  =  10“6  is  chosen  in  the  numerical 

llr0]|2  ~  '  ' 

experiments. 


6.2  Results 

The  numerical  experiments  focus  on: 
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Mesh 

nXelexnYele 

nNode 

nEqn 

Meshl 

7x1 

16 

28 

Mesh2 

40  x  8 

369 

656 

Mesh3 

40x20 

861 

1,640 

Mesh4 

50x50 

2,601 

5,100 

Mesh5 

60x60 

3,721 

7,320 

Mesh6 

70x70 

5,041 

9,940 

Mesh7 

80x80 

6,561 

12,960 

Mesh8 

90x90 

8,281 

16,380 

Mesh9 

100  x  100 

10,201 

20,200 

Mesh  10 

200  x  100 

20,301 

40,400 

Table  2:  Finite  element  mesh  of  increasing  number  of  elements  and  nodes  for  a  cantilever  problem  which  is  employed 
for  the  parallel  performance  studies  for  both  EDD-  and  NDD-GMRES. 


1.  Comparison  of  polynomial  and  ILU(O)  (ILU  without  any  fill-in)  [7]  preconditioners; 

2.  Convergence  improvement  and  the  degree  of  polynomial  preconditioners; 

3.  Parallel  speedup  of  polynomial  preconditioned  EDD-FGMRES. 

It  should  be  remarked  that,  ©,  the  spectrum  estimation  of  the  diagonally  scaled  linear  system  (1 1)  is  always  regarded 
to  be  (0, 1 )  in  the  following  experiments.  However,  as  illustrated  in  Figure  10,  making  ©  =  (0,1)  does  not  necessarily 
lead  to  optimal  convergence  performance. 


Figure  10:  Convergence  of  EDD-GMRES-gls(lO)  versus  the  estimation  about  a  (A)  (©). 


Polynomial  Preconditioner  vs.  ILU(O)  ILU  preconditioner  is  considered  to  be  one  of  the  most  efficient  (from  the 
point  of  view  of  convergence  iteration)  sequential  preconditioning  method  [7].  While  Figures  1 1  and  12  illustrate  that, 
in  the  context  of  single  processor, 

GLS(1)  >-  ILU (0)  >-  Neum( 20), 

holds  for  both  static  and  dynamic  problems  of  Meshl  and  Mesh.2.  Here  indicates  “converge  faster  than”,  and 
the  degree  of  polynomials  are  given  bracket.  As  a  conclusion,  the  performance  of  the  GLS  and  Neumann  series 
polynomial  preconditioners  is  completely  comparable  to  the  ILU(0)  preconditioner  on  a  single  processor. 
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Here  it  should  be  remarked  that  two  small-scale  systems  Meshl  and  Mesh2  are  chosen  to  study  the  convergence 
of  solvers  because  ILU(O)  does  not  work  for  a  large-scale  problem  in  element-based  domain  decomposition  setting. 
Compared  to  polynomial  preconditioners,  the  ILU  requires  more  storage. 


Preconditioned  GMRES  for  DD  Cantilever  Beam  Preconditioned  GMRES  for  Domain  Decomposition 


(a)  7  x  1  (b)  40  X  8 

Figure  11:  ILU(O)  versus  polynomial  preconditioner  for  static  analysis  about  cantilever  beam  with  pulling  load  prob¬ 
lem. 


Degree  of  Polynomial  and  The  Resulting  Convergence  Performance  From  both  the  Figures  13  and  14,  it  is 
observed  that  for  both  the  static  and  dynamic  problems  of  Meshl  and  Mesh2, 

GLS{ 20)  >-  GLS(10)  >-  GLS(1)  >-  GLS{ 3)  >-  GLS(l), 

which  implies  that  a  higher  degree  GLS  polynomial  preconditioning  may  bring  about  more  reduction  of  the  iteration 
number  for  convergence.  However,  this  is  not  be  true  for  some  relatively  large  problems.  For  example,  as  illustrated 
in  Table  3,  GLS(7)  performs  better  than  GLS(8),  GLS(9)  and  GLS(10)  for  the  static  problem  of  Mesh7. 

Parallel  Speedup  The  results  presented  in  Figure  17  and  Table  3  show  that  the  polynomial  preconditioned  EDD- 
FGMRES  achieves  an  acceptable  speedup  when  multiple  processors  are  used  on  both  the  IBM-SP2  and  the  SGI- 
ORIGIN  machine.  Four  important  observations  can  be  made.  First,  the  parallel  speedup  of  the  polynomial  precon¬ 
ditioned  FGMRES  is  related  to  the  degree  of  the  preconditioning  polynomial.  Trivially,  the  higher  the  degree  of 
the  preconditioning  polynomial,  the  more  dominant  the  matrix-vector  product  will  become  in  the  whole  problem. 
Figure  17  (a)  shows  that  for  the  same  kind  of  problem,  EDD-FGMRES-GLS(ot)  has  better  parallel  speedup  than 
EDD-FGMRES-GLS(n)  if  m  >  n.  In  contrast.  Figure  17  (b)  shows  that  the  value  of  m  does  not  influence  the  parallel 
speedup  of  the  RDD-FGMRES-GLS(w)  significantly.  Secondly,  it  is  observed  via  Figures  17  (c)  and  (d)  that,  with 
the  increase  in  the  dimension  of  the  problem,  the  parallel  performance  achieved  will  approach  a  higher  level  of  linear 
speedup.  This  phenomenon  is  also  illustrated  in  Table  3.  Thirdly,  as  shown  in  Table  3,  even  though  the  GLS(10) 
has  better  convergence  performance  than  the  GLS(7),  the  EDD-FGMRES-GLS(IO)  has  a  significant  time-cost  than 
the  GLS(7)  because  in  each  iteration  the  GLS(10)  performs  three  additional  matrix-vector  product  operations  than 
GLS(7).  Therefore,  a  trade-off  between  convergence  performance  and  CPU  time  should  be  made. 

From  Figure  17(e)  it  is  clear  that  our  implementation  is  also  highly  portable.  However,  the  parallel  speedup 
depends  on  the  HPC  system  in  question.  Figure  17(e)  shows  that  the  SGI-ORIGIN  has  a  better  parallel  performance 
than  the  IBM-SP2,  which  can  be  attributed  to  the  shared  memory  architecture  of  the  SGI-ORIGIN  machine  as  against 
the  distributed  memory  architecture  of  the  IBM-SP2.  In  the  case  that  only  a  small  number  of  processors  participate 
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Preconditioned  GMRES  for  Domain  Decomposition  Preconditioned  GMRES  for  Domain  Decomposition 


(a)  7  x  1  (b)  40  x  8 

Figure  12:  ILU(O)  versus  polynomial  preconditioner  for  dynamic  analysis  about  cantilever  plate  with  pulling  load 
problem. 


Preconditioned  GMRES  for  Domain  Decomposition  Preconditioned  GMRES  for  Domain  Decomposition 


(a)  7  x  1  (b)  40  x  8 

Figure  13:  Convergence  versus  increasing  degree  of  preconditioning  polynomial  for  static  analysis  of  cantilever  plate 
problem. 
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Preconditioned  GMRES  for  Domain  Decomposition 


Preconditioned  GMRES  for  Domain  Decomposition 


(a)  7  x  1  (b)  40  x  8 

Figure  14:  Convergence  versus  increasing  degree  of  preconditioning  polynomial  for  dynamic  analysis  about  cantilever 
plate  problem. 


in  the  computation,  the  IBM-SP2  has  a  higher  communication  overhead  between  the  computing  nodes  than  the  SGI- 
ORIGIN. 

7  Conclusions 

In  this  paper  an  efficient  parallel  finite  element-based  domain  decomposition  GMRES  solver  with  polynomial  precon¬ 
ditioning  is  presented. 

•  For  single  processor  computations,  the  convergence  performance  (in  terms  of  the  number  of  iterations)  of  the 
polynomial  preconditioned  methods  are  comparable  with  the  ILU(O)  preconditioned  ones.  Further,  for  finite 
element  based  domain  decomposition  computations,  polynomial  preconditioner  is  a  preferable  choice  as  it  is 
cheaper  to  construct  than  the  ILU(O)  preconditioner. 

•  The  finite  element  based  domain  decomposition  solver  in  conjunction  with  the  polynomial  preconditioning  (such 
as  the  Neumann  series,  GLS,  etc.)  circumvents  the  following  operations: 

-  the  assembly  process  for  associated  coefficient  matrix  and  preconditioner; 

-  reordering  of  degree  of  freedom; 

-  redundant  computations  associated  with  the  interface  elements; 

-  numerical  problems  associated  with  local  ILU(k)  preconditioner. 

This  is  unlike  in  the  case  of  standard  row-oriented  partitioning  strategies.  As  a  consequence,  a  dramatic  re¬ 
duction  in  parallel  overhead  both  in  terms  of  computations  and  communications  is  achieved.  The  parallel  per¬ 
formance  results  for  illustrative  large-scale  static/dynamic  problems  on  the  IBM  SP2  and  the  SGI  Origin  were 
presented. 
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(a)  EDD  with  increasing  m 


(b)  RDD  with  increasing  m 


Figure  15:  Parallel  speedup  results  employing  polynomial  preconditioned  FGMRES  with  increasing  degree  of  poly¬ 
nomials. 
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Table  3:  Performance  results,  iterations,  total  CPU  time  (sec)  and  speedup  (S)  for  FGMRES-GLS(m)  for  the  static 
problem  on  SGI-Origin. 
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(a)  For  increasing  problem  size  (b)  IBM  SP  and  SGI  Origin 

Figure  16:  Parallel  speedup  results  employing  polynomial  preconditioned  FGMRES. 
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